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REMARKS 

L Introduction 

Applicant respectfully requests reconsideration of the present application in view of 
the foregoing amendments and in view of the reasons that follow. 

Claims 12-15, 22-25, 27-28, 30-31, 33-34, 36-37, 44-47, 54-57, 64-67 and 70-92 are 
canceled. The cancellation of claims does not constitute acquiescence in the propriety of any 
rejection set forth by the Examiner. Applicant(s) reserve the right to pursue the subject matter 
of the canceled claims in subsequent divisional applications. 

Claims 1-5 and 68 are currently amended. Claims 6-11, 16-21, 26, 29, 32, 35, 38-43, 
48-53, 58-63 and were withdrawn. 

This amendment adds, changes and/or deletes claims in this application. A detailed 
listing of all claims that are, or were, in the application, irrespective of whether the claim(s) 
remain under examination in the application, is presented, with an appropriate defined status 
identifier. 

Upon entry of this Amendment, claims 1-21, 26, 29, 32, 35, 38-43, 48-53, 58-63, 68- 
69 and 93-95 will remain pending in the application. 

Because the foregoing amendments do not introduce new matter, entry thereof by the 
Examiner is respectfully requested. 

II. Response to Issues Raised by Examiner in Outstanding Office Action 

a. Objection to Oath/Declaration 

The Office maintains an objection to the oath based on non-initialed or non-dated 
alterations. To support this contention the Office cites 37 CFR 1.52(c). This rule provides: 

(c)(1) Any interlineation, erasure, cancellation or other alteration of the application 
papers filed must be made before the signing of any accompanying oath or declaration 
pursuant to § 1.63 referring to those application papers and should be dated and 
initialed or signed by the applicant on the same sheet of paper . Application papers 
containing alterations made after the signing of an oath or declaration referring to 
those application papers must be supported by a supplemental oath or declaration 
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under § 1.67. In either situation, a substitute specification (§ 1.125) is required if the 
application papers do not comply with paragraphs (a) and (b) of this section. 

(2) After the signing of the oath or declaration referring to the application papers, 
amendments may only be made in the manner provided by § 1.121. 

(3) Notwithstanding the provisions of this paragraph, if an oath or declaration is a 
copy of the oath or declaration from a prior application, the application for which 
such copy is submitted may contain alterations that do not introduce matter that would 
have been new matter in the prior application. 

Applicants note that rule 1.52 requires pages with changes to be dated and signed by the 
Applicant. The changes to the declaration occur in the signature block and are immediately 
signed and dated below the changes. As the inventor has signed and dated the page with the 
changes, the authenticity of the changes is not in doubt and Applicants request withdrawal of 
this objection. 

b* Claim Rejections - 35 U.S.C. § 112, First Paragraph 

Claims 68-69 and 93 are rejected under 35 U.S.C. § 1 12, second paragraph, as 
containing subject matter which was not described in the specification in such a way as to 
reasonably convey to one skilled in the relevant art that the inventor(s), at the time the 
application was filed, had possession of the claimed invention. The Office maintains, "The 
claimed invention is directed to an isolated or purified peptide (SEQ ID NO:3), that has at 
least 80% or at least 90% sequence identity to the peptide as claimed. The claims encompass 
a genus of variants that are highly variable. A skilled artisan cannot envision the detailed 
chemical structures for all the variants encompassed by the claims. Office Action, p. 4. 

Applicants disagree with the Office. However, in order to further prosecution, 
Applicants have canceled claims 69 and 93 providing for percentage identity with the 
peptides of claim 1 . Claim 68 is drawn to peptides with 1-5 conservative amino acid 
substitutions. Support for this claim is found in paragraph [0029] and original claim 68. 
Additionally, paragraph [0056] describes conserved amino acid substitutions involve 
replacing one or more amino acids of the protein of the invention with amino acids of similar 
size, size and/or hydrophobicity characteristics. As of the time of filing, groups of amino 
acids considered appropriate for substitutions had been well studied and understood by those 
of skill in the art. For example, William Taylor, "The Classification of Amino Acid 
Conservation", J. Theor. Biol 1 19,205-218 (1986), and Bordo, et al, J. Mol Biol, 217,721- 
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729 (1991) (See Attached), both describe the classification of amino acids well before the 
filing date of the application. A person of skill in the art could readily identify conservative 
amino acids from these applications and common knowledge at the time of filing. Based on 
these disclosures, replacement of 1 -5 amino acids with conservative substitutions is described 
in the application and a person of skill in the art would recognize that Applicants were in 
possession of the claimed invention at the time of filing. 

Claims 68-69 and 93 are rejected under 35 U.S.C. § 1 12, first paragraph, because the 
specification, while being enabling for the proteins set for in SEQ ID NO: 3, does not 
reasonably provide enablement for any peptide having at least 80% or 90% sequence 
homology to SEQ ID NO: 3. 

As noted above, Applicants disagree with the Office. However, in order to further 
prosecution, Applicants have canceled claims 69 and 93 providing for percentage identity 
with the peptides of claim 1 . Claim 68 is drawn to peptides with 1-5 conservative amino acid 
substitutions. 

The fact that experimentation may be complex does not necessarily make it undue, if 
the art typically engages in such experimentation. In re Certain Limited-Charge Cell Culture 
Microcarriers, 221 USPQ 1 165, 1 174 (Int'l Trade Comm'n 1983), affd. sub nom., 
Massachusetts Institute of Technology v. A.B. Fortia, 774 F.2d 1 104, 227 USPQ 428 (Fed. 
Cir. 1985). See also In re Wands, 858 F.2d at 737, 8 USPQ2d at 1404. 

As long as the specification discloses at least one method for making and using the 
claimed invention that bears a reasonable correlation to the entire scope of the claim, then the 
enablement requirement of 35 U.S.C. 112 is satisfied. In re Fisher, 427 F.2d 833, 839, 166 
USPQ 18, 24 (CCPA 1970). A person of skill in the art would be able to practice the claimed 
invention using methods described within the specification. 

Although the Office maintains that substitution of conserved amino acids is not 
enabled by the specification, it is unclear what aspect of the experimentation is "undue 
experimentation" under the current case law for the standard of enablement. The currently 
claimed invention is to short isolated peptides and methods of utilizing these peptides. 
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Preparation of such peptides is well known to one of skill in the art in a multitude of 
techniques including standard peptide synthetic technology. The Specification outlines all of 
the methods and tests necessary in order to use these peptides for testing activity. See the 
methods outlined in the Application and used for some of the described peptides in Examples 
1-14. The specification provides sufficient guidance for following the claimed methods to 
determine if a peptide affects the rate of degradation of type II collagen or the rate of 
chondrocyte hypertrophy. As noted above, as long as the specification discloses at least one 
method for making and using the claimed invention that bears a reasonable correlation to the 
entire scope of the claim, then the enablement requirement of 35 U.S.C. 1 12 is satisfied. In re 
Fisher, 427 F.2d 833, 839, 166 USPQ 18, 24 (CCPA 1970). Applicant respectfully requests 
reconsideration and withdrawal of the rejection. 

c. Claim Rejections - 35 U.S.C. § 102 

Claims 1, 69, and 93 are rejected under 35 U.S.C. § 102(b) as being anticipated by 
Qvist et al. (US Patent No. 6,1 10,689, August 29, 2000) and Claims 1, 68-69, and 93 over 
Shriners Hospitals for Crippled Children (WO 94/14070, 1994), cited on IDS filed April 30, 
2004, based on the open language in the claim which reads on all of SEQ ID NO: 3 
embedded in a longer sequence. Office Action, pp. 1 1-12 and 13-14. Applicants have 
canceled claims 69 and 93. Regarding claim 1, Applicants have amended the claims to recite 
isolated peptides consisting of the provided amino acid sequences. The claims do not 
encompass other peptides comprising SEQ ID NO: 3. If Applicant's explanation and 
amendments are not sufficient to satisfy the Office, Applicant requests clarification regarding 
an Amendment to specify this understanding. 

In addition, there is no disclosure, teaching or suggestion in Qvist of isolated or 
purified peptides of SEQ ID No:3, the hydroxylated versions of this peptide with an 
additional glycine at the N- terminus (SEQ ID Nos: 6-9), or any of the other peptides of the 
present invention. 

Likewise, WO 94/14070 does not teach or suggest an "isolated or purified" peptide of 
SEQ ID No: 3, nor the other peptides of in claim 1 . Moreover, there is no teaching or 
suggestion in either of these references for a person of ordinary skill in the art to conclude 
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that the peptides of the present invention would modulate and regulate cell differentiation as 
well as the degradation of collagen. 

Applicant respectfully requests reconsideration and withdrawal of the rejection. 
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CONCLUSION 



The present application is now in condition for allowance. Favorable reconsideration 
of the application as amended is respectfully requested. 

It is acknowledged that the foregoing amendments are submitted after final rejection. 
However, because the amendments do not introduce new matter or raise new issues, and 
because the amendments either place the application in condition for allowance or at least in 
better condition for appeal, entry thereof by the Examiner is respectfully requested. 

The Examiner is invited to contact the undersigned by telephone if it is felt that a 
telephone interview would advance the prosecution of the present application. 

The Commissioner is hereby authorized to charge any additional fees which may be 
required regarding this application under 37 C.F.R. §§ 1.16-1.17, or credit any overpayment, 
to Deposit Account No. 19-0741 . Should no proper payment be enclosed herewith, as by a 
check being in the wrong amount, unsigned, post-dated, otherwise improper or informal or 
even entirely missing, the Commissioner is authorized to charge the unpaid amount to 
Deposit Account No. 19-0741 . If any extensions of time are needed for timely acceptance of 
papers submitted herewith, Applicant(s) hereby petition(s) for such extension under 37 
C.F.R. §1 .136 and authorizes payment of any such extensions fees to Deposit Account No. 



19-0741. 



Respectfully submitted, 



Date 



By 




FOLEY & LARDNER LLP 
Customer Number: 



22428 



Stephen B. Maebius 
Attorney for Applicant 
Registration No. 35,264 



PATENT TRADEMARK OFFICE 



Telephone: (202) 672-5569 
Facsimile: (202) 672-5399 
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The Classification of Amino Acid Conservation 

William Ramsay Taylor 

Laboratory of Molecular Biology, Dept of Crystallography, Birkbeck College, 
London WCXE1HX, U.K. 

(Received 20 September 1985, and in revised form 1 November 1985) 

A classification of amino acid type is described which is based on a synthesis of 
^SSmical and mutation da*. This is organised in the form of a Venn diagram 
from ^ch subsets are derived that include groups of amino acids hkely to be 
fo stoUar structural reasons. These sets are used to descr^ ^nse^uon 
£ aSed sequences by allocating to each position the sma lest f *at conUms 
a"l rtf residue types brought together by alignment. This minimal set ass gnmen 
provld^rstople'wayofreducingme information contained ,n a seque nce alignment 
fo a form which can be analysed by computer yet remains readable. 
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1. Introduction 

The hich level of expertise current in nucleic acid research has led to the revelation 
Se number oTprotein sequences and the ability spedficaUy to alter *ese m 
a controlled way. New sequences often exhibit a close homology wtth proteins 
whTch have had ft* structure determined crystallographically and us.n| s advanced 
computer graphic facilities it is possible theoretically to alter the am.no acid side 
S of the known structure to represent the sequence of unknow n s tru cture (e g. 
BlundeH etal, 1983). The new techniques are used increasingly to des.gn prote.ns 
with altered properties using the methodology of site-specific muta 6~ 

These activities require a good understanding of the basic principles of protein 
Str and in particular 8 it is necessary to anticipate the structural I effect of 
Sucing a new Lino acid into a known structure^ This assessment* often bajed 
on the likelihood matrix of amino acid mutabilities derived by Dayhoff etal 0*2 
1978) or on the number of nucleotide base changes required to 
Fuel iW Such measures, however, ignore ^ ? %™*?T™m* Z 
relevant to the local structural environment or known fun ^ on e °f/^^ s ff ct a a " 1 d s 
in building hypothetical structures and designing new mutants it is these details 

^hTplperronsider measures of amino acid relatedness in common use with 
the aim of extracting from them features which will best assist a protein engineer 

Id whh the problem of making a mutation and assessing a sequence ahgnment 
Ttac features are represented as groupings of amino acids (sets) which « a form 
*at retatns a descriptive quality yet allows quantitative manipulations using the 
formalism of set logic. 
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2. Measures of Amino Acid Relatedness 

(A) MUTATION DATA 

For every pair of the 20 naturally occurring amino acids Dayhoff et al (1972) 
have determined the probability (or odds) that the mutation will occur in either 
direction. This matrix was most clearly presented by Sander & Schulz (1979) (see 
also Schulz & Schirmer, 1979) in a form where the entries have been ordered to 
bring frequently exchanging amino acids together. Even in its ordered form it is 
still difficult fully to appreciate the information contained in DayhofTs matrix. 
However, using the technique of multi-dimensional scaling, French & Robson (1983) 
reduced the matrix to a two dimensional plot, in which frequently exchanging amino 
acids are closest together (see Fig. 1 (a)). A similar diagram (Fig. 1(b)) was produced 
by minimising the deviation from a 2-D structure in which the entries of DayhofTs 
matrix represent the inverse of ideal target distances between pairs of amino acids 
(Taylor, 1981). Both these figures are roughly elliptical and projecting the amino 
acids onto the circumference of each ellipse produces an even simpler representation 
with no great loss of information. In this form the long axis of the ellipse corresponds 
to molecular volume while the short axis corresponds to hydro phobicity. These 
simplifications are presented in Figs 1(c) and 1(d). The cyclic order of amino acids 
obtained from this simplification is almost the same as that derived by Swanson 
(1984) (see Fig. 1(e)). 

All the above representations of Dayhoff s matrix indicate that it can be largely 
accounted for by the effect of only two determining factors; hydrophobicity and 
size. This remarkable observation must obviously dominate any attempt to codify 
amino acid conservation. 



(B) PHYSICAL DATA 

Physico-chemical properties 

The wide variety of physico-chemical properties manifest in the amino acid side- 
chains has been thoroughly considered by Sneath (1966). These have been adopted, 
or deduced independently, by others and used as a basis for considering relatedness 
between protein sequences. McLachlan (1972) summarizes these relationships by 
assigning a value to each transition between pairs of amino acids. These scores are 
presented graphically in Fig. 2 using the circle of amino acids derived from DayhofTs 
matrix as a frame on which lines connect related amino acids. All high scoring 
transitions and most other connections on the graph are local, indicating agreement 
between chemistry and mutability. The less local connections mainly join hydro- 
phobic residues, which, with the exception of proline, are relatively adjacent on the 
less idealised representations of DayhofTs matrix (Fig. 1). The missing links are, 
perhaps, more revealing: there is only a weak link between Tyr and Trp which are 
strongly tied in DayhofTs mutation matrix and there is no connection between the 
adjacent negatively charged residues and Gly and Pro. 

Measurement of properties cannot determine a common scale without reference 
to protein sequences and structures. The idea of such a scaling can be appreciated 
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Fig 1 Representations of Dayhoff's muution odds matrix, (a) Projection of the matnx by multi- 
dimensional scaling (adapted from Robson & French (1983)). Amino acids which are close together 
exchange frequently, (b) The equivalent diagram to (a) produced by pseudo-energy minimization to a 
point at which each amino acid lies in a position which gives the minimum sum of squares over the 
distance equation shown, (c) and (d) Idealizations of (a) and (b) respectively. The properties associated 
with each quadrant are indicated with the property of lesser importance bracketed, <c) A further 
idealization of . the two plots which are constrained to a circle. Ambiguities in the cyclic order have been 

reconciled by consideration of both original plots. Tne resulting order agrees with that obtained by 
Swanson (1984) except for the exchange of Arg and His as indicated by an arrow. (This probably anses 
from Swanson's use of the Dayhoff (1978) revised matrix). Swanson's nomenclature for the quadrant 
characteristics is also indicated. The one letter and three letter codes for the amino acids are as follows: 
Gly G Ala A. Val V, Leu U lie I. Ser S, Thr T, Asp D, Glu E, Asn N, Gin Q, Lys K, His H t Arg R, 
Phe F^Tyr Y, Trp W, Cys C, Met M, Pro P, asx b, glx z. 
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Fig 2 Chemical relatedness of the amino acids as quantified by McLachlan (1972) displayed on the 
Gly & Pro and Glu & Asp. 

from the relative lengths of the ellipses derived from DayhofTs matrix (see 
1(c) and 1(d)) in which the longer axis is associated w.th change of s>* > md.cat ng 
tha this on average, is dominant over hydrophobicity. Grantham (1974) scaled 
thr e ^properties to McLachlan's (1971) matrix of amino acid I substitute frequen- 
cies. "These were volume, polarity and the fraction of carbon in he side-cham 
(composition). He found the most dominant property was again volume followed 
by polarity. 

Secondary structure propensities 

From statistical analysis of the sequences of proteins of known structure, pro- 
pensities to adopt a secondary structure have been determined for each ammo acid 
(e g Chou & Fasman, 1974; Gamier et aL, 1979). These preferences have been used 
to account for clusters of amino acids which are unexpected on a phy«co-chem.cal 
basis. A clear example is the close association of G, P, D and E (see Fig 1). These 

associate because of a propensity to lie in sharply turning regions on the surface 
of the protein. Gly, because of the flexibility it imparts to the local Cham; Pro 
because of the built in turn configuration created by its back-bonding s.de chain 
and Asp and Glu because of a requirement to expose, their charges to solvent 
Robson & French (1983) indicate other instances including the close association of 
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Glu and Ala both of which favour a-helical structure, and the tight association of 
L, I, M, V and to a lesser extent, F and Y which tend towards ^-structure. 

3. Venn Diagram of Amino Acid Sets 
The idea of using a Venn diagram to represent the different relationships among 
the amino acids was adopted from Dickerson & Geis (1969) and extended to 
incorporate some of the observations discussed above. The overall layout of the 
diagram (Fig. 3(a)) was based on the 2-D arrangement derived from Dayhoffs 
mSon matrix Amino acids were then displaced (by as little as poss.bie) from 
this arrangement to form groups of residues related by common physio-chem.cal 
properties. 

(A) SIZE AND HYDROPHOBICITY 

The major sets group the amino acids by size and hydrophobicity: both properties 
which were seen to dominate the structure of Dayhoffs Matrix. Two overlapping 
sets were used to describe hydrophobicity. One was defined as all amino acids which 
have a polar group in their side-chain and is referred to as polar. The second group 
is less well defined and contains the amino acids which were considered to be 
hydrophobic. This set contains some amino acids which have polar side-chains. 
These consequently lie in the intersecting region of the two sets which can be 
considered the set of amino acids which are ambivalent to water. The inclusion of 
Lys in this set is justified by its long aliphatic side-chain which has been observed 
(Cohen et aL, 1982) to extend from a buried location and expose the terminal charge 

"lie location of Pro in Fig. 1 conflicts with its hydrophobic character. It was, 
thus, left unclassified by the two sets hydrophobic and polar. Similarly, Gly is often 
considered to lack a side-chain and be consequently unclassifiable in hydrophobic 
terms (e.g. Rees & Sternberg 1984). However, as Gly is often found buried in the 
interior of proteins it was classified as hydrophobic. 

The volume of the side-chain was considered to be sufficiently important to justify 
classification by two sets. Tbe larger of these, called smatj contains the nine smallest 
amino acids by side-chain volume (Klapper, 1971) each less than 60 A .A subset 
of this, called tiny, includes the four acids with less than three (non-H) side-chain 
atoms all of which are smaller than 35 A 3 . 

The relationship of Cys to the sets defined above is rather ambiguous. Although 
its side-chain has only two atoms, the sulphur atom is relatively large, placing it on 
the Tiny-Small borderline. Its classification is further complicated by the occurrence 
of the sulphur in two oxidation states. The reduced form contains a polanzable 
S-H bond which suggests a similarity to Ser (O-H) and with wluch it .s associated 

in McLachlan's tabic (Fig. 2). On formation of a disulph.de bond however, this 
property is lost, placing the residue more firmly in the hydrophobic camp. The 
Lociated loss of conformational freedom is difficult to assess but may be associated 
with an effective increase in volume as the linked residue cannot accommodate as 
easily to structural fluctuations. Poor packing in the hydrophobic core of the 
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Fig 3 (a) The Venn diagram shows the relationship of the 20 naturally occurring amino acids to a 
selection of physio-chemical properties which are important in the determination of protein tertiary 
structure. The diagram is dominated by properties relating to size and hydrophobic^. The amino acids 
are divided into two major sets, one containing all amino acids which contain a polar group (polar; 
and a set which exhibit a hydrophobic effect {hydrophobic). A third major set, small, is defined by size 
and contains the nine smallest amino acids. Within this is an inner set of smaller residues, J* h » ch 
have at most two side-chain atoms. The location of Cys is ambiguous as the reduced form (CH) has 
similar properties to serine, while the oxidized form (Css) may be more equivalent to Val. Other sets 
include full-charge (referred to as charged) which contains the subset positive (negative is denned by 
implication) and aromatic and aliphatic. The latter set is not as general as the name implies and contains 
only residues containing a branched aliphatic side-chain. Because of its unique backbone properties, 
proline was excluded from the main body of the diagram. An equivalent exclusive position is suggested 
for Gly by a small G. (b) An alternate representation of the relationships. A network of amino acids is 
formed by connecting those which differ by no more than two properties in the Venn diagram. Pairs 
which share the same subset are connected by a heavy bar, those with only one different property ^by a 
bold line and those which differ by two properties by a fine line. To improve clanty a few of the latter 
connections arc omitted. Many of these only connect to one of an unresolved pair (e.g. I and L) and 
it can be assumed that the connection is made to both. Because of its unique position, Pro is able to 
make quite long range connections some of which are indicated by broken lines. 

immunoglobulins near the intra-molecular disulphide link is suggestive of this 
(Cohen et at, 1981, Taylor, 1981). Considering all these aspects, two locations are 
suggested for Cys on the Venn diagram. However, a location in the sub-set containing 
Thr is also possible. 
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(B) OTHER SETS 

The remaining set allocations are based on obvious physico-chemical properties. 
These include aromatic (ring containing side-chains) and aliphatic The latter set, 
however, includes only amino acids with branched aliphatic side-chains and largely 
reflects the frequency with which this type of residue is found in 0-pleated sheet 
structure. 

The set of charged amino acids contains only those which are normally (or otten) 
fully ionized. The subset positive is included, with negative denned by implication. 

For simplicity, as few sets as possible were introduced, yet even these produce 
almost complete segregation of the amino acids with only Y-W, I-L and A-G 
grouped in the same sub-sets. Additional sets can easily be imagined but these often 
only create further distinction between residues which are already segregated. 
However, an important property not well represented is hydrogen-bonding ability. 
To distinguish the sets of hydrogen-bond donors and acceptors on the Venn diagram 
(Fig. 4) would greatly reduce clarity they are thus indicated separately in Figs 4(c) 
and 4(d). 

<C) NETWORK REPRESENTATION 

A useful representation of the Venn diagram can be made by removing the set 
boundaries and connecting adjacent residues to form a network (Fig. 3(b)) in which 
no connected pair differ by more than two properties. In this form the structure is 
more easily compared to other representations and the circular arrangement of 
amino acids corresponding to Fig. 1 is readily apparent. 

The most significant deviation from the form of Dayhoffs matrix is the separation 
between the negatively charged amino acids and Pro and Gly. This feature corre- 
sponds more to the physico-chemical relationships defined by McLachlan (Fig. 2). 
However, leaving Pro (and perhaps Gly) unclassified with respect to hydrophobic^ 
allows connections to be made with both hydrophilic (S,N,Q) and hydrophobic 
residues (F, M, L, I, V and A). Some of these longer connections are indicated in 
Fig. 3(b) by broken lines. 

It is also possible to formally reduce the Venn diagram (or network) to a tree 
structure of the type described by Jimenez-Montano (1984). However, unless the 
corresponding tree contained multiple amino acid entries information in the Venn 
diagram would be lost. 

4. Minimal Set Assignment 

The Venn diagram (Fig. 3(a)) represents a compromise between mutation data 
and chemistry. A similar compromise might have been achieved in a less g^PnH* 1 

way by calculating a distance matrix of the type described by Grantham (1974). 
Such tables of relatedness are useful for considering amino acid changes without 
regard for the local environment in the protein structure in which the change occurs 
(e g. exposure to solvent, secondary structure, etc.) and, consequently, can be applied 
uniformly along any pair of sequences to produce an overall measure of relatedness 
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of aLo acids which constitute the assigned set g,»s a measure "J****" ™ 
coSviioa ». that point raring from one (for absolute co»s«r»a.,on of tvp.) to 

t,siS»^^ 

hydrophobic a - polar, where - .ffi^es non-polar but as it is a commonly 

nomenclature de6ned UlustratcTa more complex denvauve 

assigned set it is given the trivial name ot oery v . indicates set union. In Fig. 5 

,»b£t formed by tiny u {small a (polar ^^M ^Z^^MM by . or., producing 
polar a -fcydropnobic is given thetnvgal ^^ffigj^itt (d ) indicatJ.he two sets of 
£e more comprehensible name of Any ^ ^"^S^Uen from Baker & Hubbard (1984). 

hydrogen-bond donors (c) and acceptors W ^ d «~ Cys are rare and that Met might poss.bly 
It should be noted, however, that M~««> WhKSiKM »y its rrequent change of iomsauon 
receive a hydrogen-bond. Alsc »• ^^^f J^^S^ daAy°«-l as there is still some 

rrnSis 
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charged amino acids in the bend regions of protein structures. Other minor sets 
consisting of closely related pairs, including Ser and Thr, Phe and Tyr, ^and Axg 
and Lys, were also added. All these sets are defined in Fig. 5 where they are described 
using a nomenclature adapted from set logic. 

A few example applications of the minimal set assignments to aligned sequence 
fragments are shown in Fig. 6. To give some impression of how set assignment 
varies with the number of aligned sequences and the overall homology of the 
sequences, a progression is shown from a few closely related sequences of a particular 
immunoglobulin domain to an extended alignment of the same domain (Fig. 6). In 
these it is clear that conservation is maintained mainly in the regions of secondary 
structure. This observation can be quantified by plotting the set size at each position 
in the alignment. In Fig. 7 this is done for different numbers of closely related 
sequences of the immunoglobulin K-chain light-variable domain. Of these sequences 
the Bence-Jones protein REI has a known crystallographic structure (Epp et al 
1975) The strands of ^-structure found in this protein lie in two sheets which stack 
together like a sandwich (Cohen et aL, 1981). One side of each sheet is buned wh. e 
the other is exposed to solvent. As the amino acid side-chains in a fl-strand alternately 
point to either side of the sheet they are consequently alternately buned and exposed 
io solvent. This structure is reflected in the degree to which they are conserved in 
Fig 7 where alternately conserved and mutable positions can be seen in the B-strand 
regions. The effect is most clearly seen in strands which do not he on the edge ot 
the 8-sheet. In these the conserved positions are generally hydrophobic. 

Plotting the degree of conservation with increasing numbers of aligned sequences 
revealed the interesting observation that many positions rapidly acquire a degree 
of conservation that is unchanged by the addition of further sequences to the 
alignment. Conservation is rapidly lost on aligning the first 20 sequences in order 
of decreasing homology to REI (see Fig. 6(b)) but the addition of another 50 
sequences to the alignment causes little alteration of the conservation profile. 



Fig 5 Intersection and union of the sets defined in Kg. 6 can produce a vast number of amino acid 

oVi t£ind™ve "or" to represent set onion. Intersection of two sets is represented by the 
u^ertine Negation is indicated by the prefi, "non^and 

it is attached. This produces recognisable phrases such as non-roi^R and 

Occasionally a commonly used set is given a "trivial" name: for example, the «rf««omJM 
^SeferredtoasWive" and^ 

o rtm . w hirh have rather long formal names are simplified by reference to intuviauai ammo 

included when the two residues they represent occur in the set They do not, however, count 
number of members in the set is considered. 
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5. Conclusions 

In the rapidly developing field of protein engineering it is important to have a 
measure of amino acid conservation that can be applied to several homologous 
sequences of which at least one has a known tertiary structure. General measures 
of amino acid conservation, such as Dayhoff's likelihood matrix are best suited to 
situations where there is no structural information about the sequences. Their use 
becomes limiting when applied to local regions of the protein sequence where, for 
structural reasons, the mutational freedom of a particular residue may be greatly 
restrained. With knowledge of the local structure, however, it is. possible to analyse 
these restrictions and use them predictively. An example of loss of information by 
averaging, which is apparent in DayhofTs matrix, is the resistance of cyst(e)me to 
mutation. This, undoubtedly, arises from the evolutionary need to conserve disul- 
phide bonds, but such a restraint does not apply in the reducing intracellular 
environment and to apply it to the comparison of the sequences of cytoplasmic 
proteins is, therefore, misleading. 

The classification of amino acids defined above, and its use in describing sequence 
alignments, allows the type of conservation observed in structural "micro-environ- 
ments" to be rigorously quantified. The important aspect of the approach is that 
not only can the degree of conservation be measured, but the qualitative aspect of 
the conservation is also measured. Together these measures capture virtually all the 
useful information that can be extracted from a number of aligned sequences. Such 
information will be of use in designing new mutants as a protein engineer can 
analyse a sequence alignment to find, for every position, the range of possible amino 
acid changes that might be acceptable. On a wider front, the approach is being 
applied to the analysis of residue conservation in well defined structural motifs 



Fig 6 (a) Alignment or Ihe variable domains of the immunoglobulin kappa-chains (light-variable 
domain) 'found in the P1R database. The sequences ran down the page and are r^r^ted ill > °" e 
letter amino acid code (insertions are indicated by a bar "|"). THe numbers identify the _ position in the 
sequence or REI and next to these, heavy bars indicate regions of 0-stracture To thenght of the 
sequences the amino acid set (see Fig. 5) which best describes the alignment is indicated both by name 
and number. This description is indented in proportion to the number of insert.ons^he number just to 
Z 'left of the set description is the number of amino-acids which constitute the set. Tim .gives a measure 
of specificity and is the number plotted in Fig 7. Absolutely conserved positions are .nd.cated by the 
res duTnam'e only. The set assignments are divided into two regions. The outer ~>>>^*™^ 
the 23 sequences with greatest homology to REI while the inner is derived from the entire " sequences- 
TfieleqSTames and their relative homology can be found in (b) which ,s a »aU« o homology 
between every pair of sequences. THe homology is calculated as a percentage of residue idenuty matches 
ov« aUmatched pairs in two given sequences and is entered in the mauU at a position «o»«efe»md 
% u,e two sequence names. For clarity, homologies over 70% a,e filled solid, those in the 60s are a dot 
uiose in the 50s a "5" and those below 50% are blanked out. The entnes in the matrix have been ordered 
£ch Aarmost similar sequences tend to be adjacent on the diagonal. This is achieved by m.nimising 
the second moment of homology about the diagonal; i.e. 



N N 

I I H fJ (i-;) 2 ^min 



where H {i is the percentage homology between proteins at positions i i and ) and JV is ^the number of 
proteins. The increments of ten sequences each corresponding to a plot in Fig. 7 are indicated. 



5 I s- 



i °- 

O n 



» 3 SX CI S) ri I <v O -I ■ ' 



£.5 2. 



(a) 



CD 

3 
ft 

§ 

or 



(D 

8 

8 

■o 

a 

i 

cd 



CD 

o 

i. 
c. 

I 



^ - 

& ! 

T5 



-o f 
3 



c 
in 1 



e 



SEQUENCES 



73 



i Cftwowooo«MoTtwo^ 

5 ■ r^T^?^w«!??tt1IT»IITmiIM»1f riTTt»TTtf If ft IfmiTtlttTf HlUt**" 

hmBmSSBBSSBism 
i 

1 lillllillllialill ill 

$fFTT»TTTTITOTrTOTB»trTtPWTrTTI»»tI«tTT?TTrffMWt»TMTV«Ttf»»M1T««tt« 

dKmo^isshss^^ 
vssssssssssssssss^^ 

ii illllllllll I till 1 11 II II I II J III I Mil II 1 1 II It I II 11 NMIMMMHIIuUHilM 

« t*«i«t™wm™iiii«imM^ 

ii o«m«owto«tioiitee«^»*fM»iMfTMtt*€ii«t««f»»«*"*«»"""*"B»:"» 

\\ i*'tjBi»niTn.IjUw»lIITTtTI»iM»ISl*»«IJ«*l»t«»llSS»SlSll«»*SlMM« 



II 1 

na i 
it! i 

"I 1 

it 

10 ' 

II 
it 

"I 
"I 

vol 



70 DDC0ft000OBB0fB00t"ttDD0W>T*l>ft«»t*00«0W0W^ 



SETS to 73 



4) It •la*ll. •r.R-rfr.rbt lit* 

ftl lVM*ll_«^-f.«r.f*lM_»**-«BOaiTIt* 
$••11 f*»» It . •r.l»»'t« - .»«»-W. »■* 

11 1 f*U1 IC . ar. »R*ll, 4>»a*rt I c 

j j r*^r»ftvllf. Mar-mi* t*t* 
ft 4-TIB-.*~.l*MA.rai»I* 
M l**lUtl.«r.MUl' 

II imt*tOPll©fK.*r.lMU' 

•0 M'tMU. .r.fCiaf 

T<-ti«T.»r.lBAll_f»OftBt.«r.f- 
IT H*»»r»-l»»«r«pte«%l«.»r-I»*tl.«r.t- 
90 l*fl«* 

11 IO"Tiat.ar.»ewt.aaa-»00»»Ttt 

it 1 1 • v«rr-»»*ra»i»a»i«.ar . mm.»»*- * . *r • * 

V*l t#a tl I C - •r.larva..a»a-»01 *■ 

iMiir.v.iaiU-Niii* 

11 1 0*1 Ml. «r.r«t 1»01»*T1 C 



10 VntfnftTtC.ar.lart*.! 



l*uomm. w-it I wftc.^.i* 

t« t1-tIftt.ar.f«l.B.,n-»-»ae»ATIC.«'.r . 

11 10-llat.«r.f«l»B_«..-ftB«a»lW 

14 IIMiat.aa.VOlta.aa-^IOOIiMTt. ar.a* 
IT »*-t»e»-l^«r«»»«fcU.«r.*MU : «r.«* 
ii lortwt.^.wit**^^*"**'*" 

|»lt»MT«.#r.lttP«im. 
91 I *" *T»t0^« CKIC»«'-*" 



SUIT rjltt Atf 11TTTTTTTTTTTWT1T TTSTTTTIT J' ??? k • ■ • Zkt k l L^L^LL^A. 

TTlTfTi«?TlTTtfltflTfTTTT1tTTtllffTTTWWIMTt»TttTT*TPT»Xlimi»MltTttrT 

TjBmmmimuiium^ 

77 WCJi»«lltCM$CUllI»«MliJllMlWMPr»»*»M0«2«^»^*^"^^ 

Tt tnoiftoo«<o«i«««oooiOfiotiooofo«i50«i«iiioitioeM^ 

i? .tit«xtott t ii» t e»i>tMf<coo»«iiiwtttccoMw.i«otottMto*um^ 

iJ- ^ UlUuilUlllllltlMUIIIUUIMIWUllUICI^l It UWI^ 
111 OT»0!tt*MtTTTT*fflT.T»tf«rtV*TI«rTlTIT.T^ 

17 1 »Tl*t»fTtlTTT»T»F?PTfTtT1TfT?»P»TfFTffT»»f»ftttTr?t»TTfM»ltT»T»tff»»Jtff 

u| ««S"««"««cctcctcccc«ctccccc«tccceu«^«wc« 

■ tl lOCCUO«0«eCl«OV«COCOOO««OOOOOOOOMOOlOWOW80««0«0^ 

to I MtotccoMco^coo«90o«G«oMaioa&«oQsiRtio«irtooiioocr«BecQ0062CCOOt»caiconi 

« ««ll«nQ»OWOflW.M«t0CKt»lMfl»ITIWI»WI«l<IU«MTinm 

•l iiisif«its«is«iii$Mii«tiiMiW»iiKCi«iitti«STrr, 6 tw«OTtUjH 

*l I II 1 1 II I II M HIM IJ If Ml II 1 1 II ISII Mill! Mill IMIIIIIIIIIIIItltllll llllll 

n i m m i i iinii'm»"»»'i!!!!!J!!!!!!!!H! I 

♦l If I If Ml ll|Ninillllll»MIIIIIIIMIIII»lllltl«MII » '•"!«;»"'• 
«7 » tiff lit mint Till III f Iff Iff ITITtT1iCT1TIT1MT»»»?*»»»T»tri Til tit lTl»tlt| 
*t A »»i»Ff>*r»»»F>f»»»»»>»tt>>FFF>P»»t»P»»»P»>PPPFt'>f^gF>^ 

too I «oi«fcWo«Mesrootoa»»opocoa«Mw«tiwciiw»«t«eeii*6ecowo«io««^**« 

101 I «tiCt<t.«tH"Xt6Cttt6ttlS*6fcCtCtttCt€.t*CCCttCttCtCttt«CtCC*iCt«l«««t<C« 
10! ■ llllf lltTITTtlTtfltlTtttlttlllftttTtltlf ttftlf ftTttlHIIIIITItfftlTf fttt| 
103 I ■•■•■■miTMtt«*l«itlM«tl««tt«*ltt|«l«IilMtl«ltlMt««Of«lt«>«tMt»f>tl| 

ioi I »utiitiiciitiDiiio«fOctceioittrfii»»tttniiie»f»imifft*tf»t«i»ii»iit 
to* ixititiiiiiitip«iu*iitiitvtuti»iii»fttiiniii»»*«ii»v«i»i»»1iJ^"*^J 

tot iiiiii«ulii«iHlM"iw"mii«in»il|Oi«««« IM » Ml, ";!""!!f"!! 
IC7 »|l|«i»tt|»MtttittWl»»t«IMI« t lltMt|t|l||MI*WI«l"l*WW»"M»«"JI 

107 MMMMIIMJIMMMMIIMMMtlllMlMIMIMMMIoiMltMIIMMMIIMMI 



11 t0*tl«f.»r.»Ot«i.«*<*-»*0«»TlC" 



B» 1 !"»••- r-^r** 1 ** Jl -*- ■'-l 

II l*l«*ll. 

15 io-Tnt.^-Ftu.it.»*»-it««»nc* 



« t"tx»t. •••-••••I I »«_»r*» , *» fc Tl le - •*■■*'* 
1 , i'lno* 

IT H'llll0h«IK.«r.lMU.M»-» 

11 I0'tl«t.«r.f'©t»i_"«--»*0««TK* 

1] ■•1S»l.«#.««ff«llM.«M«'*«»fltC.«r.f 

?7 t*«lt II ti- 
lt t*viav.»r.«aiu-»«tu~ 

10- »*»Vl»«lXt.»r.l»rt»_»"^*e*M 
Tl 3**t»NaiXf 

ir.n-TtRt.f.'otM- 

II 1fTX»?.»r.»Bl«t_»«»-tt0RtTlt.«r.» 

10 5*M.lm»Ttl.»».l»rf»_«»^>-F0L»l* 
1* 1ftM>B«- 

11 |*TlBT«*r.»««*«lM.ilfOrM*>vll«.«r.T a 




1$ ti"UUlt.l»«-»-»*-F«l»l' 



II irrtMBMWK.f.tMU' 



iD rtiM* 

t» i-iitn«i* 

t* rniiMiic , . 

41 t'»«M.a«»-IICWIf1C.«r.CMIU0 



SETS to 23 



1-M v 4r.»»fl |c.aaa-*«llfl«l a 
I'lttMATtC- , 

l«41tFi«»ttC.«*.l**Ba.»^- r01,t ^ . 

uif«ii IC.«r. l«*Ll.-^*»»«'«»*»> *■« 
$• ayaraaa* I l«_a««- «»* I« »»* 

|>tlt|fN«* 

41 S'MAU.mit.aa.l* . 

• 4 |l«»^F-»r#r»^»»l«.^.t"«t-"^;«* r '" 

I? I » • » ar »-li7«ra#»»l U . . laalt • i* - ■ 

•0 J-ltlfHltK.arA»ra..a«*-F0lll 

ii i"w*ii.m»o. 

tl i-***r a 't>M«-*t>ftf»t«" 

74 7'lfilW „ ... 

74 7'»LIWI4ttC.~.t«»tt.»» W «.* B 

*!! $'ar«»-«»»Fi»«.«»**** s l ,, ' t " 
70 4 'lull _*••-«»• 

10 1-AHHWTIC.««-.1.*»- •*_"•«- 

»* ... 

11 l - <nHCI0. 
10 I'fiat* 

10 I'tlil' ——--a - 
Jl l*wr*r«a»» Uc^»«*-P«l I* lit 

11 #• ti at. ar.Min . 

44 ir».r r -*F*'-»»*»»«--'^"* Ll -***" f ** r - 1 



I M-MTOioaao* It.**-. t«*lt. 

t t4•«fO■oaao•tC.•«'.»■*lt.*^-^ , 

I l»4lIM*TtC* 

I 1 1" !•* r-a» iraaaak U . ar . taail.M 

I |'4t0fUttC.ai•.»llF•*tlC.a#'.»'• 

I S'H»aya»aylU.aa af allll>l" 

I to-lia»-*«-.»«.aa_»aa^4BOa*tTf 
| l*ia»ll. f01ft0.ar.fr 
1 ft-Biaa* 



T- ftl tfuft 1 1 C. . iaatl.aa^f i 
•*CwfttfttO.M-.ar«r«aavU«" 



11 

•t ll'tara*" 
»7 iniftT.ar.f01l»* 
•ft 1 1" »»r r -»>r<r*»K«aic . wt . SrM.L_«m 
CLT 

74 l-itifaatic- 
fto 

31 l*CHAa«ft0.ar.Br«r«a*»ll>* 
• ic 



S I'lfaD* 

II I'ftaf.ar.SaftLl.fOlftB' 

1 ft'ftiao* 

•V |ft-«TOI0faO»IC.»r.llUlt.aa«P'f" 

10 l*riaf 

Jl •'fiat.ar.taftll.fOiftt* 

19 ■*CMftlCID.«r.Bvaf>aa»if lla* 

17 l-aaiftTtf 

|| ft'TIMl.M-.laAll^fOlftl* 

»<» l-aHfa*HC.»r.l*rpa.aa*-'fOiao* 
V* fOl ftB_aai^ft»ea*1 IX . ar.C Mltl O* 



til 

41 ft* Mill .POO** 

U 10*1lBl.ar.feiftt.a*a-M0aftt(C* 

71 I'ftMffUTte" 

IV 1l'L»r 0 a* 

14 V'1iar.«r.a«M«l<M_»)»4rB»ay]lc.ar.t.i 
11 1'Hvaraakvllt.aaA-fatltIVf' 
47 l*iaii^» i ft»Baa»Ha* 



10 1**1 IF*»TXC. •*•.!•»'#•_• 



^f«l*»* 



,.a«a-f01»»."a^F' 
ft« •■|ruik.«aa*f* 
tit 

Vl **i«oa»TIf 

CH 

II $*Rr•e•»*rli«.••*>~•*"' l ** , ''" 

17 11'Tiat.ar.fOifti* 
10 IffClIf 

If ift*at0B0f»»Ollt.»r.littH.„«a«-f* 
*» I'taMi.fOlftt.ar.a" 



II IV *>rr * F<*»»fc*M«. •r.»«ftlt_a*«-f.a»-.t 
fat 

ClT 

11-i«*ll.ar.f0t4«.a*«-*t0«*T|C- 
10 Vila?" 
24 I*lt«lla»* 

47 •* rtM.il. «»«-i»6aJT It. »#-.C«t.»M»' 
7ft l*4lIfa*TIC* 
3» l-«7«v»r*rlte.»a*-fO$IUU* 
•ft ll'»»r».fc»»Va»l»afti«.ar.l«All_«^r-f.ar.t 
it r- FOLit. WU lit. W.tKlHtl* 
9 ♦-•taO* 



R.p^a train » rogion - Ho«»t N1«2 

,:^. _Chfin V -t: r»;l>n - «r. 



TSpTTPnTv rations ' *o«sa Uftlr 3T 

IdEp* <«air v-::: ra.ion - i*.a« a* 

KilU Cham * ragion, - «0«»a 0 «U- 
chrin fl rtcgr»r v r.Jion " 

U» Chain «-.!•« - »2 • 



t3 3t3t555 _ 
L Ui55SSft5f>«35< 

k 5*35a>5M5«* 



fSS3'S*S33535S33353533353335 355 »JJ * » 
Kv555335533353553355553553553 555535 
|B«ii£iwW» »aa2 55535555555|555 5 5 



335535M3«ft5ft»5355! 



rnSpi enam v l4-ff * 10 

\ 2 s»tc£-jci;?s REI ■ 

^K33S3 (Hi" * - '■•■a 



r«?iOA " H««an Ml 

ration - *u«an 3c«. - 
<*op* chain ration - Hunan »M. 

chclr V-! r*jion - *o«an ««. 
Kapc* chain V-I ragion - Hu-an *a. 

chain ^Ji'« - fl0 /- • 

n t pp» ehain v-! ration - M««m M««. 

Hal-pa ch*ln V-I rajion * H*.«n *v . 
VUapppfl chfloj "? ration - Mu.an Ka . 
V l,»a chai n" V-I rajlo" - 

Kaoaa chain »*l ragion - Mu«*n Ooa. 
K«pc* tniln V-I ragion - Hy.l« Lay. 
«*pp« chain V-I ragion - C« 

K?ppt chain V ration - Pou»* MOPC 173 
<*pp. chain V-i:i M-Aon - Mwan Po- ; 
""pa Chain V-ITX rO,ion - «ol 
V» <M3» chain V -ttI r« C lon " Hu«»n Si* 

raSrrRTTs »-«s r.don - 

Kappa chain V ragU*» - "ov** antl-ar 
Kapaa chain V-:v ragion - tan 
««PP« chain V ragion - "J*.^, 
Kappa chain V ra 5 lon* - rooia '"C4.# 
Kappa chain orte«r,cr * raolon, - Pov* 
Kappa chain V ragion, - Hou*a p C*«J*' 
«,pp. chain V ragion. - Houta 
n^ppa chain V raglo«* - «2*J5 
W« k >od> chain V ragicn - Mou«a PC213* . 
Ull* chain prac«r.or V ~ 
Kappa chain v raglo-. - Maoaa PC28-0 
Kapp* chain V ragion - fou*t ^1C74. 

, V-TI ragion - Ta» 



1 ^ijSsJ?? 5 5 $1 35S1 535*5 5 5 5 5 55? 5 5 5 ? ? ? 5 
MUi^K5r3iU555355555555533555555555 

™ ^SmM*3«S3«3*53 5!5 t5«5 ^ ? 

555333355535355553555555533 53353 5 5 
isMB333«M»5t3 55555553555555555554 
^51?^5T353555550553555?535«;55 ! 3 
W 553553535355535555533535555553 5 
,H6;3555335353^5555555555535 3 5 
S5;553333S55555f>355#35«555?355 3 5 



09333 --^^ 

aM#v#55«5?3555555«0< 
( 5#355533535553555a5 
^5«3333335555# 
-W*3T55155555^ 
SMB333335535353 
55335353 



t 0<55555553« 51 
ptf-53 5555 5 Si 
^55355T5?3 3 5l 
■M,5<5555335 5 5l 
_Jt*5 5355 5 31 
5>79^«oW««3;3:5355S 3 Si 
j5335533355«53333335535555| 
55535353««aMttt?353«533 5'Sl 
5^555552^5555555555555 

i33«333««3«*3355S553353!3355S] 
\»5#5#55553S355535«P5335555 1 

riJ«#l555?5555533«5355553? 
rta)3^555555535553#J5535553^ 
>5«5«5«555355533335«5333355 
TfV5««f5^5555555?5*;55?355l 
75«55355#«555¥m5| 

555555#Jgi»5535 J 
m555553*5**;55333 
5553333#JJ«35533 3 
( 5333353«5«e5353333| 
»»33S535#5*i«55S553 
53555353#*«43335555 
S5555333«3«05333 .3 



,{•55355335335555 3 3 5 
^55355555 555333 355 
555 33 3 
i555 55 53 

53535555 53 

1535 55 5 5*35 



. 5 55 



Koppa chain . • 

Kapoo chain v ragion - «ona 
Kappa chain »-ZI ragion - Mu««« Mil - 
Kappa chain v r.glon - ro«*a •tuy. 
K.ppr chain V-II raglor - N„.o« Cur 
Krooa chain V raglons - Mouio «** , 
5J Ka ppa thrift procurer v raglor - mo«*« 
Kappa crai'n V racior - RabOit «$-5. 
Kappa chain V ragico - Cabbit «1.5- 
Kappa chain V ragion - 
Kaop< chain v racion - 
Kappa chain V raglor - Rtbblt 
Kappr chain v ragion - a,bblt .5*7. 
Keppa chain V ragion - «abblt k-2?. 
Kaooa chain V roglon - Babbit - , 

Kappa chain V ra;ion pracurtcr - «*bbi 
W K aooa chair v ragior - Moo*t J53« • . 
lapof. chain V ragion - Doc • - 

Kappa ch«in v ragion - Mouaa 
Kappa crcln v ragion - «abbit 
Kappa chain v ragion - Rabbit 1 JC . - 
Krppo chain » ragion - Qabbit 3-1.- • 
Kap P < chain v ragion - iabbit 2717 
Kapca chain V ragion - +' " • ; 

Kacpr chain V ragion - Rabbit *hf0-3 
ttppa chair V ragion - nou»a ^167 - 
j% lappa chain v ragion - Rabbit ip-1 



55 



pa ^ 

I! 



I •'• 



(facing page 217) 



218 WILLIAM RAMSAY TAYLOR 

(super-secondary structures) commonly found in globuiar proteins (Sibanda ^& 
Thornton, 1985; Taylor et al, in preparation). In these structures it is found tna 
residue variation is restrained at particular locations in the motifs for general 
structural reasons. The observed patterns of conservation can then be used to predict 
the occurrence of the structural motif in a sequence of unknown structure using 
pattern recognition techniques such as the template matching method of Taylor & 
Thornton (1983, 1984) and Taylor (1986). 

This work was begun while the author was a research fellow at IBM-UK Scientific Centre, 
Winchester, and the author thanks IBM for support and <»<*P U ^ 
currently supported by the SERC as an advanced fellow and thanks Prof. T. L. Blundell and 
Dr J. M. Thornton for valuable discussion. 
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Suggestions for "Safe" Residue Substitutions 
in Site-directed Mutagenesis 
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The conserved topological structure observed in various molecular families such as globins 
or cytochromes c allows structural equivalencing of residues in every homologous structure 
and defines in a coherent way a global alignment in each sequence family. A search was 
performed for equivalent residue pairs in various topological families that were buried in 
protein cores or exposed at the protein surface and that had mutated but maintained similar 
unmutated environments. Amino acid residues with atoms in contact with the mutated 
residue pairs defined the environment. Matrices of preferred amino acid exchanges were then 
constructed and preferred or avoided amino acid substitutions deduced. Given the 
conserved atomic neighborhoods, such natural in vivo substitutions are subject to similar 
constraints as point mutations performed in site-directed mutagenesis experiments. The 
exchange matrices should provide guidelines for "safe" amino acid substitutions least likely 
to disturb the protein structure, either locally or in its overall folding pathway, and most 
likely to allow probing of the structural and functional significance of the substituted site. 



1. Introduction 

Site -directed mutagenesis has become a very 
important and yet facile tool to explore the struc- 
tural and functional significance of particular 
residues within proteins (for example, see Knowles, 
1987; Shaw, 1987; Gruetter et al. t 1987). A typical 
experiment would involve substitutions of an amino 
acid thought to be essential for catalysis and then 
assaying the resultant variant for activity. It is 
central to the success of these experiments that 
disturbance of the protein fold and structural 
characteristics, locally as well as globally, be kept to 
a minimum; otherwise the loss of activity, for 
instance, would be a result of conformational 
changes and the exchanged residue be improperly 
identified as catalytic. Residue substitutions, where 
the latter situation does not occur, can be con- 
sidered as "safe". 

Natural evolution has "engineered" protein struc- 
tures by modifying certain molecular properties 
such as substrate specificity or surface charges and 
yet conserved the global protein topology. By 
comparing known conserved three-dimensional pro- 
tein structures it is possible to glean hints about 
how this process was performed (Lesk & Chothia, 

0022-2836/91/040721-09 $03.00/0 



1980, 1982; Chothia & Lesk, 1986; Bashford et a/., 
1987); rules obtained in this way are useful for 
designing site-directed mutagenesis experiments. 
Protein engineering in the laboratory often faces 
similar trials. For example, suppose that charges on 
a protein surface are to be altered to construct a 
cation binding site. Which amino acids near the 
surface would be safer to substitute to achieve the 
desired charge configuration? 

In this work residue exchange matrices are calcu- 
lated that represent point mutational preferences as 
observed in homologous and known three- 
dimensional protein structures. Alignments of 
primary sequences determined from spatial super- 
position of the main-chain 0* and taken from nine 
molecular families allowed identification of structur- 
ally equivalent residues in each of the familial 
sequence sets. A search was then performed for 
equivalent residues that had mutated but main- 
tained similar unmutated environments defined by 
these atoms in contact with the central residue 
pairs. Such point mutations as observed in known 
tertiary structures arc likely to be, with present-day 
knowledge, the closest possible mimic of in vivo site- 
directed mutagenesis. 

Residue exchange statistics and their significance 

! 
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were determined for all the structural equivalents in 
the various molecular families. The preferred and 
avoided substitutions were elicited from three struc- 
tural contexts: buried residues, amino acids exposed 
beyond some water-accessible surface area thres- 
hold, and then all cases regardless of accessible 
state. These exchange matrices should provide con- 
siderable aid in the difficult process of deciding 
which residue to exchange and then with which 
amino acid it should be substituted to maintain 
protein structural integrity. The preferred 
exchanges are also discussed in terms of residue 
physicochemical characteristics. 



2. Data and Methods 

(a) Aligned structures 

Aligned sequence sets were taken from 9 molecular 
families: globins, immunoglobulins, cytochromes c, serine 
proteases, subtilisins, calcium binding proteins, acid 
proteases, toxins, and virus capsid proteins. The total 
number of sequences, each with known 3-dimensional 
structure as contained in the 1989 Brookhaven database 
collection (Bernstein el al. t 1977), was 55. Table 1 lists 
their database code identification, protein name, species, 
reference for the 3-dimensional structure, and, where 
present, reference in which the alignment of the familial 
sequences used here was determined. The alignments were 
generally achieved by careful examination of the X-ray 
crystallographic structures coupled with spatial super- 
position of the main-chain C a atoms (Rossmann & Argos, 
1981). In 3 cases (calcium binding proteins, acid proteases 
and toxins) structures were superimposed by the present 
authors using the technique of Rossmann & Argos 
(Rossmann & Argos, 1976, 1977; Argos & Rossmann, 
1979). Due to the increasing number of solved protein 
structures, many of those used in the present work 
extracted from the 1989 release of the Brookhaven data- 
base were not included in the references showing the 
familial alignments. These further sequences, indicated by 
an asterisk in Table 1, were aligned by the authors to the 
closest family member in both sequence and structure. 

When considering statistics for buried residues (solvent- 
accessible surface area below an upper limit), both 
constant and variable domains were utilized from the 
immunoglobulins. However, the variable regions were 
excluded from the exchange matrix statistics involving 
surface-exposed amino acids, since large segments of the 
variable domain loops bind antigens and therefore are 
subject to special constraints. For a similar reason, side- 
chains contributing to subunit interface or cofactor 
contacts were not included in the substitution 
calculations. 



(b) Similarity of environment 

In a previous paper, Bordo & Argos (1990) carefully 
defined a measure of similarity (see S" as given by them in 
eqns (1) and (3)) between 2 atomic environments 
surrounding structurally equivalent residues. The same 
measure is used here. An environment or neighborhood 
for a residue (called a central residue) is defined by the 
number of atoms and amino acid types that are within 
4-5 A (lA = 01nm) of any side-chain atom in the 



surrounded residue. The similarity score S is expressed as 
a fraction and is defined as: 

t 

The denominator is simply the mean number of atoms 
belonging to residues present in at least I of the 2 environ- 
ments (b; = main-chain atoms, 5] = side-chain atoms). The 
mean refers to. the 2 sets of atoms in each of the 2 
environments. The numerator is the sum of the mean 
number of all main-chain atoms by the 2 environments 
regardless of the mutational state of the equivalent neigh- 
borhood residues plus the mean number of side-chain 
atoms J t from residues that touch at least 1 atom of the 
mutated central residues (i.e. within 4*5 A). The term S t is 
0 if the ith residue is mutated and 1 if identically 
conserved. £ ( is over all residues that touch at least 1 of 
the central residues. Therefore, similarity of 2 environ- 
ments will be diminished only if there are mutations in 
the equivalent environmental residues. That is, if struc- 
turally equivalent residues forming the neighborhood of a 
central residue in I protein structure are conserved in the 
other structure despite their absence in the neighborhood 
of the equivalent central residue in the latter structure, 
the similarity score is not decreased. This allows for cases 
where contacts made by the substituted central residue 
with its neighbors change only in consequence of its 
change in size and shape. For instance, environmental, 
residues can move considerably to accommodate a small 
residue changing to a large one. Though the side-chains in 
contact with the larger residue are not in contact with the 
small one, they are nonetheless available without 
mutation to make contact as necessitated by the substi- 
tuted residue. Water-accessible surfaces of the combined 
main-chain and side-chain for each residue was calculated 
by the procedure of Kabsch & Sander (1983). 

(c) Statistical significance of exchanges 

Counts were made for every observable substitution of 
central residues with similar neighborhood at a preset 
similarity threshold. To give statistical significance to 
these figures, a. comparison between observed and 
expected number of substitutions was performed under 
the following hypothesis. Consider a pool of N amino 
acids. xV = ^7« i (i= 1 to 20), where the ith amino acid 
type appears n t times. The exchange is a directed 
replacement of the amino acid i with the amino acidj (e.g. 
Ala -> Asp) and substitution i—j refers to either i-*j or 
j-+i (e.g. Ala -> Asp or Asp -* Ala). There are A r (iV — 1) 
possible exchanges in the pool, of which Xi n f( n »~') are 
between residues of the same kind. Therefore, 
N' = N(N— 1)— J^ttffn,— 1) is the number of possible 
exchanges involving pairs of different residues. Since the 
observed mutations refer to only substituted residues. N'._ 
and not A\ represents the pool of available exchanges. 
The probability p ( _j is then given by n { nj/N\ and the 
probability to observe a substitution becomes: 

■ Pi _j = 2n { n,/iV\ (2) 

Given a total number of X observed substitutions, the 
expected number of substitutions n^j is therefore Xp^j. 

The population n t (t = I to 20) was calculated in the 
following manner. Given a set of structurally aligned 
sequences for a particular molecular family, each align- 
ment column would generally contain several amino acid 
types. The count for the population n t (i = 1 to 20) was 
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TabJe 1 

Tertiary structures used in this work 



Family 


BRKf 


Protein 


Origin 


Structure reference 


Alignment reference % 


Hemoglobin 










Leak & Chothia (1980) 




4HHB 


Hemoglobin 


Human 


Fermi et al. (1984) 






2MHB 


Hemoglobin 


Equine 


Ladner et al. (1977) 






IFDH 


Gamma globin 


Human 


Frier & Perutz (1977) 


* 




1MLBD 


Myoglobin 


Whale 


Phillips (1980) 






1MBS 


Myoglobin 


Seal 


Scouloudi & Backer (1978) 


* 




2LHB 


Hemoglobin V 


Sea lamprey 


Hendrickson et al. (1973) 






1ECA 


Erythrocruorin 


Chironomou8 


Steigemann & Weber (1979) 






2LHL 


Leghemoglobin 


Lupin 


Vainshtein et al. (1977) 




Immunoglobulins 








Amzel & Poljak (1979) 


IFB4 


FABKol 


Human 


Mat-quart rl al. (1980) 


* 




1FBJ 


FAB IgA 
Fclggl 


Mouse 


Navia et al. (1979) 


* 




IFC1 


Human 


Deisenhofer (1981) 


* 




1FC2 


Fc 


Human 


Deisenhofer (1981) 


* 




1IG2 


Fc Kol 


Human 


Marquart el al. (1980) 


♦ 




IMCP 


FAB 


Mouse 


Segal et al. (1974) 






1PFC 


Fc Iggl 


Porcine 


Bryant et al. (1986) 


* 




IREI 


FAB Bence-Jones 


Human 


Epp et al. (1975) 


* 




2RHE 


FAB Bence-Jones 


Human 


Furey el al. (1983) 


* 




3FAB 


FAB New 


Human 


Saul el al. (1978) 






2HFL 


FAB Iggl 


Mouse 


Sheriff etal. (1987) 






1F19 


FAB 


Mouse 


Lascombe et al. (1989) 


* 


Cytochromes c 










Dickerson (1980) 




155C 


Cytochrome c650 


Pantcoccus D 


Timkovich & Dickerson (1976) 






3C2C 


Cytochrome c2 


Rkodospirillum R 


Salemme et al. (1973) 






40YT 


Cytochrome c 


Benito fish 


Takano & Dickerson (1980) 






1CYC 


Ferrocytochrome c 


Tuna fish 


Tanaka tl al (1975) 


+ 




ICCR 


Cytochrome c 


Rice 


Ochief of. (1983) 


* 




451C 


Cytochrome c55l 


Pseudomonas A 


Mateuura et al. (1982) 




Serine proteases 










Craik el al. (1983) 


2SGA 


Proteinase A 


Streptomyces G 


Moult et al. (1985) 






3SGB 


Proteinase B 


Streptomyces U 


Read et al. (1983) 






2ALP 


Alpha-lytic protease 


Lysobacter E. 


Fujinaga et al. (1985) 






4CHA 


Alpha chymotryjiKin 


Bovine 


Tsukada & Blow (1985) 






3PTB 


Beta trypsin 


Bovine 


Marquart el al. (1983) 






2TRM 


Trypsin 


Rat 


Sprang et al. (1987) 


* 




ITON 


Tonin 


Rat 


Fujinaga & James (1987) 


* 




2KAT 


Kallikrein 


Porcine 


Bode el al. (1983) 






iSGT 


Trypsin 


Streptomycts G 


Read & James (1988) 


* 




3 EST 


Elaatase 


Porcine 


Meyer et at. (1988) 


* 




3RP2 


Mast cell protease 


Rat 


Remington et al. (1988) 




Sublilisins 








Froemmel & Sander 




1SBT 


Subtilisin 


B. amyblique- 
facensis 


Aldenef al. (1971) 


(1989) 




2PRK 


Proteinase K 


Fungus 


PaehJer et al. (1984) 






ICSE 


Subtilisin Karlsberg 


/?. subtUin 


Bode et at. (1987) 




Calcium binding proteins 










3CLN 


Calmodulin 


Rat 


Babu et al. (1988) 


* 




3CPV 


Ca-binding 

parvalbumin B 


Carp 


Moews & Kretsinger (1975) 


* 




3ICB 


Ca binding protein 


Bovine 


Szebenyj & Moffat (1986) 






1TNC 


Troponin C 


Chicken 


Satyshur et al. (1988) 


* 


Acid proteases 












2APP 


Penicillopejwin 


Fungus 


James <fc Sielecki (1983) 






2APR 


Rhizopuspepsin 


Mold 


Suguna et al. (1987) 






4APE 


Endothiapepsin 


Fungus 


Pearl & Blundcll (1984) 




Toxins 














1CTX 


Alpha cobratoxin 


Cobra 


Walkinshaw et al. (1980) 






1NXB 


Neurotoxin B 


Sea snake 


Taernoglou et al. (1978) 






2ABX 


Alpha bugarotoxin 


Krait 


Love & Stroud (1986) 




Viruses 














2TBV 


Tomato bushy stunt 


Virus 


Hopper etal. (1984) 


Rosnmann et al. (1983) 




4SBV 


Southern bean mosaic 


Virus 


Silva & Rossmann (1985) 






2STV 


Satellite tobacco necr. 


Virus 


Jones & Liljas (1984) 






1MEV 


Mengo 


Virus 


Luo et al. (1987) 


Luo et al. (1987) 




4RHV 


Rhino 


Virus 


Arnold & Rossmann (1988) 


Luo et al. (1987) 



t The column labeled BRK gives the Brookhaven database entry name (Bernstein et al., 1977). 

% References showing structural sequence alignments used in this work. An asterisk refers to the cases where the structural alignment 
was performed by the authors. 
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Table 2 

Residue counts for the nine, structural 
protein families 



Residue type 


Buriedt 


Exposed J 


All§ 


Gly 


161 


226 


445 


Ala 


182 


250 


515 


Ser 


108 


375 


533 


Pro 


34 


194 


249 


Asp 


28 


255 


315 


Cys 


38 


23 


71 


Asn 


33 


258 


313 


Thr 


79 


341 


477 


Glu 


11 


239 


255 


Val 


206 


166 


415 


Gin 


26 


201 


248 


His 


20 


69 


105 


Met 


49 


47 


107 


Leu 


165 


135 


331 


He 


125 


104 


265 


Lys 


5 


297 


320 


Arg 


9 


162 


193 


Phe 


89 


88 


208 


Tyr 


38 


128 


191 


Trp 


30 


33 


68 



t Residues having solvent-accessible surface less than or equal 
to 10 A 2 . Counts are performed as described in Data and 
Methods. 

X Residues having solvent-accessible surface more than or 
equal to 30 A 2 . Counts are performed as described in Data and 
Methods. 

§ All residues are counted, regardless of their exposure to 
solvent. 



increased by 1 only once for each amino acid type in the 
alignment column, regardless of its number of appear- 
ances. This was consistent with the counts for redundant 
central residue pairs. For instance, suppose an alignment 
position contained 3 Ala and 2 Gly residues in a particular 
topologic family, a total of 6 residue substitutions can be 
counted; however, since they are all structurally equiva- 
lent, only 1 should be taken; namely, that Gly- Ala substi- 
tution with the highest environmental similarity score. 
This selection is consistent with the aim of this study to 
find conserved neighborhoods tolerating mutant central 
residues. Total counts n t (i = 1 to 20) were determined for 
all the alignment positions in all the molecular families 
under 3 water-accessible conditions and are given in Table 
2: The probability to observe a substitutions i-j out of X 
trials taken from a pool of N residues (A^X,™,) 
assuming a binomial distribution is given by: 

^(^■») = (f)p?-/(l-P^) x -. 0) 

where p t _j is given in eqn (2), and: 

(*)- * ! , 

Given the number of observed substitutions n, w> it is 
straightforward to calculate its chance probability with 
eqn (3) (see e.g. Korn & Korn, 1968). If the sum of all 
probabilities Pi-j{X, a) for n,_ y ^ a ^ X is less than or equal 
to 0-05, the preference of the substitutions can be con- 
sidered significant at the 95% confidence level or better. 
Consider the following hypothetical illustration. Suppose 
the pool of residues consisted of 10 amino acids for each of 



Table 3 

Number of substitutions for buried residues involving 
volume and polarity alterations 



Similarity (%)t 


100 


95 


90 


86 


80 


Observed substitutions 


12 


34 


65 


124 


206 


Total number with volume 




1 


9 


24 


57 


change > 1 methyl group 












Total number with polarity 


2 


2 


14 


33 


63 


group change 












Hydrophobic/hydrophi lie 






1 


1 


1 



substitutions 



f Percentage similarity threshold of central residue 
environments (see eqn (1)). 

the 20 types (n, » 10, i = 1 to 20), then N = 200 and the 
number of possible non-identical amino acid exchanges :V 

is: 

(200 x 199) (10 x 9) = 38,000. 

If, for instance, 1000 substitutions are observed (Jf = 
1000), the expected n H using eqn (2), is 2x1000x10 
x 10/38,000 ~ 6. Assume that for a given pair i-j (e.g. 
Ala-Thr) the observed number of substitutions n A u-Thr i fi 
12, then if 

^nutlOOO.l^-rP^flOOOa^-r . . . + 

^Aia-Thr( 1000, 1000) < = 0O5 

the substitution preference between Ala and Thr can be 
considered significant with at least 95% confidence. 

3. Results and Discussion 

Table 2 lists the residue population for each of the 
amino acids in the three structural states examined 
for central residue substitutions: (1) buried in the 
protein core (solvent-accessible surface for both 
residues ^10 A 2 ); (2) exposed (solvent-accessible 
surface area >30 A 2 ); and (3) all the possible access- 
ibility states allowed. The residue pool represents, 
under the constraints discussed in Data and 
Methods, the composition of amino acids available 
for possible substitutions. These populations are 
important in calculating the substitution statistical 
significance (see Data and Methods). 

In a previous paper (Bordo & Argos, 1990), 
substitution statistics were gathered from only one 
sequence family (globins) and for only buried 
residues- The buried exchange counts given here 
increased by at least a factor of 5 from the addition 
of eight sequence families (Table 1). The basic 
trends observed were nonetheless conserved. The 
results in Table 3 make this salient. Very few of the 
total substitutions show volume changes greater 
than one methyl group (^35 A 3 ) and a movement 
(referred to as a "jump") to another polarity group 
(Grantham, 1974) where the three possible groups 
are denned (1 letter code used) by (WYFMCILV), 
(PATGS) and (HKRQDEN). These constraints 
imply considerable impact on the development of 
protein cores in structures maintaining main -chain 
fold; a detailed discussion can be found in the earlier 
work (Bordo & Argos, 1990). All ensuing work given 
here is unique to this report. 
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Table 4 

Number of substitutions for exposed residues 
involving volume and polarity alterations 



Similarity (o/ 0 )f 


100 


95 


90 


85 


80 


Observed substitutions 


100 


152 


322 


500 


941 


Total number with volume 


28 


54 


124 


268 


406 


change > 1 methyl group 
Total number with polarity 


42 


69 


153 


28U 


547 


group change 
Hydrophohic/hydrophilifi 
substitutions 


3 


5 


19 


39 


78 



| Percentage similarity threshold of central residue 
environments (see eqn (1)). 



Table 4 lists similar statistics (volume and 
polarity group alteration counts) for exposed 
residues with similar environments. Tt is clear that 
they display considerable point mutation freedom 
compared to the buried residues. Approximately 
one-third to one-half of the substitutions (depending 
on the percentage similarity of the neighborhood) 
involve changes in polarity group or volume altera- 
tions greater than one methyl group, whereas only 
about 15% of the buried substitutions involved 
such changes. However, few side-chains (~5%) 
alter the sign of their charge or jump (~3%) 
between opposite polarity groups (i.e. hydrophobic- 
hydrophilic) despite their exposure. 

It was insisted that each of the two substituted 
residues have a water-accessible surface area of at 
least 30 A 2 to be deemed exposed. This represents 
approximately a hole just large enough for a methyl 
group to pass through and was found from the 
previous globin statistics (Bordo & Argos, 1990) as 
well as the present data (not shown) to be the 
minimal exposure at which radical volume and 
polar alterations between exchanged central 
residues are observed. 

Figure 1 shows the actual exchange counts for (a) 
buried, (b) exposed and (c) all cases where the 
central residue environments were 90% or greater 
(lower matrix half) and 70% or greater (upper 
matrix half) in similarity. The symbols plus 
(preferred exchange) and minus (avoided exchange) 
are shown in the upper half of matrices if the counts 
were reliable at the 95% confidence level or better 
as well as consistently preferred or shunned for at 
least two similarity levels within a range of 100% to 
70% calculated in steps of 5%. As expected, the 
70% similarity data produced the most observed 
exchange counts and the greatest number of substi- 
tutions deemed significant. However, given the 
lessened neighborhood similarity, noise is increas- 
ingly introduced; nonetheless, trends are preserved 
from the 90% to 70% levels (Fig. 1). 

Several interesting substitution trends are observ- 
able in the Figure 1 exchange matrices. Though the 
high count substitutions are not always deemed 
statistically significant, they represent a useful 
starting point in deciding which substitutions to try 
in structure-altering experiments as site-directed 



mutagenesis or protein engineering. It will take 
considerable time and effort to produce sufficient 
X-ray crystallographic protein structures to deter- 
mine the significance of all the possible 
substitutions. 

For the protein core, residues within each of the 
following subsets are generally interchangeable with 
high statistical significance: (A,G), (A.V), (N, D), 
(M,L), (F, L), (F,Y), (A,S,T), (V,I,L) and (Y,W). 
This is shown diagrammatically in Figure 2. In an 
examination of the counts alone, surprising results 
can be found for many of the amino acid types. 
While Thr can exchange with Ala and Ser, Asn is 
the next most desirable. Cys prefers Ala or Val as 
substitutes. Though Val can rather freely go to Ala, 
He and Leu, He prefers primarily only Val and Leu. 
Met and Phe favor Leu, rather than lie, as an 
ersatz. For exposed substitutions unexpected results 
are also in evidence. Gly prefers Asn as the most 
desirable charged or polar substitute. If Ala must be 
replaced by a charged residue, Lys and Glu are 
statistically favored. Ser prefers Asp and Asn and 
not Glu, Lys or Arg, while Thr is the most favored 
substitute. Asp especially avoids Tyr at the surface. 
Val's favorite partners are He and Leu, while Tyr 
prefers Phe. Interestingly, the hydrophobic residues 
Val, Leu and Tie tend to substitute amongst them- 
selves despite some exposure at the surface. If an 
exposed Val must be changed to a charged residue, 
Lys is the best candidate; and so forth. 

Some substitutions are consistently allowed 
regardless of exposure or buxiedness (Fig. 2). Among 
the highly significant preferred exchanges, in single 
letter code, are (G, A), (S,A), (T, A), (N,D), (T,S), 
(V,T,L) and (F,Y). 

Calculating the logarithm of the ratio of the 
observed to expected counts for each possible 
substitution and for all observed cases having 70% 
environmental similarity (Fig. 1(c), upper right 
matrix), it was possible to build a scoring matrix 
analogous to that determined by DayhofF et al. 
(1978). The correlation coefficient between the 
elements of the two matrices was 0-64. Tt would not 
be expected that the two matrices correlate well as 
the results of this work concern single substitutions 
over only close molecular generations, while the 
Dayhoff et al. observations are cumulative over 
many and multiple mutations. 

The matrices listing preferred or safe and avoided 
or unsafe substitutions taken from actual tertiary 
structures should prove exceedingly useful in site- 
directed mutagenesis and protein engineering 
experiments. It would be helpful to ascertain if a 
residue is exposed or buried before choosing a 
substitution. If the protein three-dimensional struc- 
ture is known, this information is evident. If only 
the sequence has been determined, secondary struc- 
ture prediction and/or a hydrophobic ty plot (for a 
review, see Argos, 1990) should provide a good guess 
as to the appropriate solvent-accessible state of the 
residue in question. If not, the exchange counts 
taken from all residues in the familial sequence sets 
are given in Figure 1(c). 
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Figure 1. Observed substitutions for (a) buried, (b) exposed and (e) all cases. The lower halves of the matrices give 
substitution counts for central residues with 90% or greater similar environments, while the upper halves are for 70% or 
greater similarity. When counts show, a statistically meaningful (95% or greater confidence) increase or decrease 
compared to the expected figures for at least 2 similarity levels ranging from 100% to 70% in steps of 5%, with the 
trend being consistent, a 4- or — sign is given to indicate preferred or avoided substitutions, respectively. In the exposed 
data, immunoglobulin variable domains were not included. 
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Figure 2. Statistically preferred (95% or greater confi- 
dence level as indicated by a + in Fig. 1) substitutions 
observed in buried residues (grey segments) and exposed 
residues (black segments) are shown. Residues roughly 
equivalent are grouped together in 5 subsets, which 
generally correlate with side-chain physicochemical 
properties. 



Lim & Sauer (1989) have performed mutation 
experiments on A repressor protein core side-chains 
and the mutants were assayed for functionality and 
stability. Interestingly, all of the single protein core 
mutants could have been predicted from this work 
(Bordo & Argos, 1990). 

Site-directed mutagenesis is an important tool in 
probing the structural and functional significance of 
particular residues within a protein sequence (for 
reviews, see Knowles, 1987; Shaw, 1987). Amino 
acid residues might be altered to check for their 
participation in catalysis, cof actor or substrate 
binding, molecular and receptor recognition, 
domain interfaces, oligomeric interactions, and the 
like. It is essential in such experiments that the 
protein fold, locally and globally, not be perturbed; 
otherwise : loss of activity or whatever aspect is 
under study would be incorrectly ascribed to the 
mutated residue. "Safe" substitutions are thus 
requisite for the success of the mutant probe as an 
indicator of critical residues in structure and func- 
tion. This work provides exchange matrices that 
should be directly applicable in maintaining the fold 
and that are taken from known three-dimensional 
protein structures with diverse folds. Of course, the 
results represent general trends and cannot be 
expected to work in every local context, but they 
should be a great improvement over randomly 
selected substitutions and act as a good guide 
regarding what to substitute and what not to 
substitute. For example, suppose Cys were a 
suspected active site residue. If exposed or buried, 
though the substitution data base is not sufficient to 
identify statistically significant exchanges for Cys, 
the observed substitutions counts would recom- 
mend Ala; if the Cys is likely to be buried, Val is 
also a possible candidate. 



Zvelebil & Sternberg (1988) examined several 
known tertiary structures and determined that His 
is the most frequently occurring catalytic residue. 
Assuming its exposure to the solvent, the exchange 
matrix suggests Ser as the safest substitution. In 
the review by Shaw (1987) on specific point 
mutations for several molecular species, the Gly- 
Ala substitution is one of the most frequent 
mentioned. Apparently the proteins maintained 
their fold while proven assays displayed altered 
activity. The exchange matrices presented in this 
work suggest the Gly-Ala substitution as highly 
significant in the buried or exposed states. 

In protein engineering as well as molecular 
modeling, where new structures are built from those 
with known tertiary and homologous primary struc- 
tures (for a review, see Sali et a/., 1990), it is often 
crucial to know which residues can be substituted 
safely. Can a substituted residue in a molecular 
model be placed in the same environment displayed 
by the known native structure? For instance, if a 
His is to be introduced in an exposed loop to eng- 
ineer cation binding, would it be safer to substitute 
a Ser, Glu, Asn or Lys in the known structure? The 
exchange matrices of Figure 1 provide direct 
answers. In fact, Sali et al. (1990) in their review on 
modeling cite only two specific examples where 
residues are allowed limited choices due to folding 
requirements. Both involve constrained Ser-Thr 
substitutions in buried /?-strands where the side- 
chain oxygen atoms bond to main-chain atoms. 
Among the preferred exchanges, the Ser-Thr one 
is highly preferred both in the exposed and 
buried substitutions matrices reported here (Fig. 2). 
A further protein engineering example would 
involve a desired residue substitution to stabilize a 
predicted or known helix. The exchange should be 
from a residue of lower to higher helical preference 
(Palau et al., 1982). Combining this requirement 
with the exchange matrix counts of Figure 1 should 
provide a very rational substitution, especially if 
the tertiary structure is not known, which is typic- 
ally the situation. For example, if He were buried 
and part of a helix is to be stabilized, the matrix of 
Figure 1(a) suggests Leu and then Met as likely 
substitution candidates. 

Malcolm et al. (1990) have published results of 
mutants of game bird lysozymes. Point mutations 
on in vivo triplets Thr40-Ilc55-Ser91 (TIS) or Ser40- 
Val55-Thr91 (SVT) included, respectively, TVS, 
SIS, TIT and SVS, SIT, TVT. The mutants were 
assayed for thermal stability and it was found that 
TIT" SIT and TVT were more stable than the 
respective wild-type and TVS, SIS and SVS less so. 
The buried-residue exchange matrices in this work 
would predict that Val -» He and Ser -> Thr would be 
ideal substitutions to preserve main-chain fold and 
enhance thermal stability under the assumption 
that increasing the volume of a side-chain within 
one methyl group would result in better hydro- 
phobic packing to maintain the protein structure. 
In every case, this is exactly what occurred experi- 
mentally. In fact, when the exchange from the wild- 
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type involved a volume decrease, the fold was 
maintained but thermal stability diminished. 

The authors thank Gareth Chelvanayagam, Jaap 
Heringa and Peter Sibbald for many helpful discussions. 
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